Image Deblurring using DeblurGAN
DSAN 6500 Project
1. Introduction
Image deblurring is a critical task in low-level computer vision that aims to restore sharp, high-quality images from those corrupted by motion blur or defocus blur. These distortions commonly arise in dynamic environments, where camera shake, object motion, or low shutter speeds can severely degrade image quality. Restoring such images is essential not only for aesthetic purposes but also for the performance of downstream applications such as object detection, scene understanding, and autonomous navigation.
Traditional deblurring models rely on explicit physical models of blur formation and handcrafted priors, such as total variation or edge-preserving smoothness constraints. However, these approaches often struggle with generalization, especially in the presence of complex, spatially-varying blur.
In recent years, Generative Adversarial Networks (GANs) have emerged as a powerful paradigm for image restoration tasks. These models outperform conventional methods by learning rich priors directly from data, avoiding the need for handcrafted blur kernels.
This project presents a DeblurGAN-style architecture trained on GoPro data, leveraging both Wasserstein loss and VGG-based perceptual loss to preserve high-level semantic features. The generator learns to map blurred images to their sharp counterparts, while the discriminator enforces photorealistic fidelity at the patch level using a PatchGAN approach.
2. Dataset
This project used the GoPro_large dataset, which contains paired blurred and sharp images specially curated for training and evaluating image deblurring models. The dataset contains 2,103 training pairs and 1,111 testing pairs, with each pair containing a blurred image and its corresponding ground-truth sharp image. All images are originally in high resolution (1280x720 pixels) and are organized into seperate folders for blurred and sharp samples.
To make the dataset suitable for training, all images were resized to 256x256 pixels. Additionally, the pixel values were normalized to the range [-1,1].
3. Model Architecture
The proposed solution adopts a GAN framework for image deblurring, consisting of two main components: a Generator and a Discriminator. The generator is responsible for restoring a sharp image from a blurred input, while the discriminator attempts to distinguish between real sharp images and those generated by the model.
3.1. Generator
The generator takes a blurred image as input and outputs a deblurred version of the same resolution. Key features of the generator include:
- Reflection Padding: Applied before convolutions to preserve edge information and avoid artifacts introduced by zero-padding.
- Downsampling Layers: Two convolutional layers with stride 2 are used to progressively reduce the spatial dimensions while increasing the depth of feature maps.
- Residual Blocks: A sequence of 9 residual blocks allows the network to learn complex transformations while preserving spatial coherence. Each block consists of two convolutional layers, each preceded by reflection padding, followed by batch normalization and ReLU activation.
- Upsampling Layers: Two stages of nearest-neighbor upsampling are followed by 3x3 convolutions to restore image resolution.
- Output Layer: A final convolution with tanh activation maps the output to the range [-1,1]. A residual connection is added from the input to the output, and the result is scaled to [-1,1] using a Lambda layer.
3.2. Discriminator
The discriminator is modeled after the PatchGAN architecture. Instead of evaluating the entire image holistically, it divides the input into small patches and classifies each patch as real or fake. This encourages the generator to produce fine-grained realistic details.
The discriminator takes either a real sharp image or a deblurred image generated by the generator as the input and processes it through five convolutional layers. The number of filters progressively increases. Each convolutional layer is followed by a LeakyReLU activation function and Batch Normalization, except for the first layer. Instead of outputting a single real/fake classification, the network produces a grid of patch-level predictions, capturing local realism across different regions of the image. This grid is then aggregated and passed through a final dense layer to produce a single scalar value representing the overall realism of the input image.
3.3. Loss Functions
Perceptual Loss: Perceptual loss is designed to capture high-level semantic differences between images, as opposed to pixel-wise losses like MSE which often result in blurry outputs. In this project, the perceptual loss is computed by passing both the generated (deblurred) image and the target sharp image through a pre-trained VGG16 network and extracting feature maps from an intermediate layer (block3_conv3). The loss is then calculated as the mean squared error between these feature representations. This encourages the generator to produce outputs that are not only visually sharp but also structurally and semantically similar to the target image, leading to more realistic and perceptually pleasing results.
Wasserstein Loss: Wasserstein loss addresses several shortcomings of the traditional binary cross-entropy loss used in GANs. It provides a continuous and meaningful gradient even when the discriminator performs well, avoiding vanishing gradients. This makes GAN training more stable and improves convergence. In this implementation, the discriminator is trained to output high values for real images and low values for fake (generated) images. The generator aims to generate outputs that are indistinguishable from real images in the eyes of the discriminator. In the context of Wasserstein loss, this translates to minimizing the negative of the discriminator’s output for the generated images. As the generator improves, the Wasserstein distance between the distributions of real and fake images reduces, ultimately pushing the generator to produce more realistic and sharper images. This adversarial dynamic helps guide the generator toward creating high-quality deblurred outputs.
4. Training
The model was trained for 200 epochs using a batch size of 1. Training followed a standard GAN paradigm where the generator and discriminator were optimized alternately.
Figure 1: Generator loss steadily decreases throughout training, indicating consistent improvement.
However, GAN loss values alone do not provide a complete picture of model performance, especially in tasks like image restoration where perceptual quality is crucial. Therefore, to determine the best model checkpoint, we evaluated outputs from various epochs using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) — two standard metrics for assessing image quality.
Figure 2: PSNR increases steadily over the training epochs, reflecting improved reconstruction accuracy.
Figure 3: SSIM also improves with training, capturing better structural and perceptual similarity.
From the plots, we observe that both PSNR and SSIM values peak around epoch 165, after which there is a slight decline. This suggests that continued training beyond this point yields diminishing returns and may risk overfitting. As a result, the model from epoch 165 was selected as the final checkpoint for testing.
5. Results
Figure 4
Figure 5
To assess the effectiveness of the proposed deblurring approach, we compare our results with other deblurring models. The table below summarizes the PSNR and SSIM achieved by various techniques. Our model achieves higher PSNR than traditional CNN-based approaches and comparable SSIM values, highlighting its ability to produce sharper and perceptually realistic outputs.
| Method | Sun et al. | Nah et al. | Xu et al. | DeblurGAN |
|---|---|---|---|---|
| PSNR | 24.6 | 28.3 | 25.1 | 26.42 |
| SSIM | 0.84 | 0.916 | 0.89 | 0.87 |
6. Conclusion
This project explored a deep learning-based solution to the problem of motion deblurring using a Generative Adversarial Network (GAN) architecture trained on the GoPro dataset. By incorporating adversarial training with a perceptual loss derived from intermediate features of a pre-trained VGG16 network, the model successfully generated sharp and visually realistic deblurred images. The generator, designed with residual blocks and reflection padding, was trained alongside a PatchGAN discriminator to jointly learn global image consistency and restore fine textures, enabling the production of sharp and perceptually realistic outputs. The model achieved a Peak Signal-to-Noise Ratio (PSNR) of 26.42 and a Structural Similarity Index (SSIM) of 0.87, demonstrating strong reconstruction capability.
However, while the PSNR values indicate good pixel-level accuracy, the SSIM suggests that there is still room for improvement in preserving perceptual and structural similarity. One promising direction for future work is to refine the perceptual loss component by extracting features from multiple layers of the VGG16 network. This could provide a richer, more hierarchical understanding of the image content, encouraging the generator to better reconstruct both low-level textures and high-level semantics. Further improvements could involve experimenting with deeper or more expressive feature extractors like VGG19 or ResNet, which may help enhance structural fidelity and elevate SSIM scores.